https://cimentadaj.github.io/blog/2020-02-06-the-simplest-tidy-machine-learning-workflow/the-simplest-tidy-machine-learning-workflow/
This notebook motivates the use of the tidy models as a way to simplify the processes associated with building models and evaluating them. While it is entirely possible to use Base R methods to organize data and build models thereon, the tidyverse and tidymodels provide a uniform interface for interacting with data and model assembly.
What I hope you get out of this includes the following:
Predictive Modeling is a type of Machine Learning which itself is a sub branch of Artificial Intelligence. The following graphic provides us with some history of these domains. This is helpful if you are trying to orient yourself in the world of analytics and machine learning. Note that AI has been around for quite some time. The Wikipedia definition of AI is:
The study of “intelligent agents”: any device that perceives its environment and takes actions that maximize its chance of successfully achieving its goals
I found the following image on Intel web page but can’t remember where.
Machine Learning relies upon “patterns and inference” to “perform a specific task without using explicit instructions”. It is a form of Applied AI that attempts to automatically learn from experience without being explicitly programmed. Think of Predictive Modeling as a subset of this which falls into two categories:
Supervised
Algorithms that build a model on a set of data containing both the inputs and the desired outputs (“labels” or known numeric values). When you want to map input to known output labels. Build a model that, when applied to “new” data, will hopefully predict the correct label.
Unsupervised
Algorithms that take a set of data that contains only inputs, and find structure in the data (e.g. clustering of data points).
This lecture is concerned primarily with Predictive Modeling. Some examples of Predictive Modeling include:
Predict current CD4 cell count of an HIV-positive patient using genome sequences
Predict Success of Grant Applications
Use attributes of chemical compounds to predict likelihood of hepatic injury
How many copies of a new book will sell ?
Will a customer change Internet Service Providers ?
A typical workflow in support of predictive modeling might look this:
In-Sample vs Out-Of-Sample Error
The goal of predictive model is to generate models that can generalize to new data. It would be good if any model we generate could provide a good estimate of out of sample error. It’s easy to generate a model on an entire data set (in sample data) and then turn around and use that data for prediction. But how will it perform on new data ? Haven’t we just over trained our model ?
Performance Metrics
For either case (regression vs classification) we need some type of metric or measure to let us know how well a given model will work on new or unseen data - also known as “out of sample” data. for Classification problems we look at things like “sensitivity”, “specificity”, “accuracy”, and “Area Under Curve”.
For Quantitative outcomes, we look at things like Root Mean Square Error (RMSE) or Mean Absolute Error (MAE). Here is the formula for Root Mean Square Error (RMSE). P represents a vector of predictions and O represents a vector of the observed (true) values.
While ’old" sounds somewhat negative, there is absolutely nothing wrong using an established approach to model data using Base R. This involves applying specific packages to build, for example, a logistic regression model, a decision tree, or a support vector machine.
The job involves identifying the appropriate package(s) and then dividing the data up in to training and testing pairs after which a function name is called to do the work.
At a minimum, one must specify 1) the training data set and 2) a formula which indicates what the target and predictor variables will be. Then you look at the result
Let’s run through an example using the infamous Pima Indians dataset to predict whether a car has an automatic transmission (0) or a manual transmission (1). First, let’s chop up the data into a training and test pair. To get the ball rolling with a practical case, let’s consider the Pima Indians Data Frame. Read in a copy.
url <- "https://raw.githubusercontent.com/steviep42/bios534_spring_2020/master/data/pima.csv"
pm <- read.csv(url)
head(pm)
## pregnant glucose pressure triceps insulin mass pedigree age diabetes
## 1 6 148 72 35 0 33.6 0.627 50 pos
## 2 1 85 66 29 0 26.6 0.351 31 neg
## 3 8 183 64 0 0 23.3 0.672 32 pos
## 4 1 89 66 23 94 28.1 0.167 21 neg
## 5 0 137 40 35 168 43.1 2.288 33 pos
## 6 5 116 74 0 0 25.6 0.201 30 neg
The description of the data set is as follows:
In predictive modeling we have some important terminology:
corrplot::corrplot(cor(pm[,-9]))
Look at some differences between the positive vs negative cases:
myt <- table(pm$diabetes)
barplot(myt,
ylim=c(0,600),
main="Pima Indians - Count of Cases",
names.arg=names(myt),
ylab="Case Count",
xlab="Cases")
grid()
boxplot(glucose~diabetes,
data=pm,
horizontal = TRUE,
col="aquamarine",
main="Distribution of Glucose per Group")
grid()
So let’s plot Glucose vs Mass to see if any interesting hypotheses come to mind (or not):
# Split the data on the target variable which is diabetes
mysplits <- split(pm,pm$diabetes)
# Create a plot window with appropriate dimensions
plot(pm$mass,pm$glucose,type="n",
main="Glucose To Mass Relationship",
xlab="Mass",
ylab="Glucose",
sub="Data from Pima Indians Data")
# Now put up the mass vs glucose for each group (pos and neg)
cols <- c("green","red")
for (ii in 1:length(mysplits)) {
points(mysplits[[ii]]$mass,
mysplits[[ii]]$glucose,
col=cols[ii],pch=18)
}
# Draw a grid
grid()
# Put up a legend
legend("topright",
legend=names(mysplits),
pch=18,
col=cols,title="Group")
Check the data using a boxplot:
boxplot(mass~diabetes,
data=pm,
horizontal = TRUE,
col="aquamarine",
main="Distribution of Mass per Group")
grid()
At this point we want to build a model to predict whether a given observation will be positive for diabetes (or negative).
What we will do is create a training and testing data set using the sample function. Here we will carve out rouhgly 80% of the data for training and 20% for testing.
set.seed(123) # Makes this example reproducible
nrows <- nrow(pm)
propo <- 0.70
(idx <- sample(1:nrows,propo*nrows))
## [1] 415 463 179 526 195 118 299 229 244 14 374 665 602 603 709 91 348 649
## [19] 355 26 519 426 751 211 590 593 555 373 143 544 490 621 23 309 135 224
## [37] 166 217 290 581 72 588 575 141 722 153 294 277 767 41 431 90 316 223
## [55] 528 116 606 456 598 39 159 209 758 34 516 13 69 409 308 278 89 537
## [73] 291 424 286 671 121 110 158 64 483 477 480 67 663 85 165 648 51 74
## [91] 178 362 236 610 330 127 212 310 243 113 619 687 151 614 160 391 155 747
## [109] 5 326 280 567 238 339 754 137 455 560 589 83 654 196 694 712 500 344
## [127] 693 459 20 764 164 52 534 177 554 84 523 392 302 597 668 650 430 428
## [145] 250 429 398 714 381 545 40 522 473 200 125 265 186 573 252 458 152 54
## [163] 538 235 289 185 413 617 735 607 205 697 564 684 701 346 664 468 509 57
## [181] 457 357 279 270 347 129 218 337 749 539 748 553 390 498 222 421 627 163
## [199] 656 704 674 225 389 117 629 55 731 557 768 134 447 104 591 210 349 401
## [217] 258 635 386 725 24 466 130 682 377 170 445 234 422 508 689 80 688 475
## [235] 696 343 323 479 450 111 317 574 287 292 226 297 681 237 628 33 746 396
## [253] 721 707 76 94 636 30 562 434 175 706 532 115 739 338 96 465 358 543
## [271] 695 661 502 580 397 404 230 148 667 556 350 425 631 423 202 81 558 503
## [289] 232 670 106 375 11 669 364 550 403 461 549 31 414 505 699 513 484 16
## [307] 197 420 678 417 412 551 12 437 609 66 50 204 579 435 741 559 384 122
## [325] 399 472 315 259 353 248 604 48 331 100 108 301 10 499 658 752 402 515
## [343] 442 653 600 395 744 8 114 261 29 306 659 679 282 73 267 738 262 733
## [361] 451 745 219 184 352 662 119 643 685 691 482 36 563 240 379 120 488 304
## [379] 432 723 449 105 281 180 547 448 241 548 167 47 191 37 174 599 303 207
## [397] 19 615 708 504 103 760 652 188 139 762 690 189 311 361 572 38 633 319
## [415] 376 334 416 546 393 371 436 21 199 87 728 497 464 520 536 517 622 367
## [433] 6 128 156 288 49 227 239 193 462 507 491 190 112 378 625 467 576 321
## [451] 59 305 61 540 525 736 750 88 132 612 251 203 246 641 460 700 366 131
## [469] 578 162 645 168 276 78 566 743 95 221 405 161 533 242 181 620 761 594
## [487] 273 187 171 313 582 136 79 638 521 400 446 427 345 646 138 719 583 686
## [505] 272 673 454 755 201 637 657 489 632 2 65 260 247 734 710 418 640 206
## [523] 124 596 7 228 45 514 25 753 470 672 268 501 263 169 711
Now, we create the train/test pair:
pm$diabetes <- factor(pm$diabetes)
pm_training <- pm[idx,]
pm_testing <- pm[-idx,]
print(paste("Training data has",nrow(pm_training),"rows"))
## [1] "Training data has 537 rows"
print(paste("Testing data has",nrow(pm_testing),"rows"))
## [1] "Testing data has 231 rows"
Let’s build a model using the Generalized Linear Models function. Note that there are other types of functions we could use but most people have at least heard of Logistic Regression so let’s start with that as it also provides us the opporunity to understand some basic ML concepts.
The glm function is part of Base R so you don’t need to load any additional packages to use it. We need to specify a formula, some training data, and a “family” argument.
myglm <- glm(diabetes ~ .,
data = pm_training,
family = "binomial")
summary(myglm)
##
## Call:
## glm(formula = diabetes ~ ., family = "binomial", data = pm_training)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.2424 -0.7256 -0.4283 0.7341 2.9311
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -8.405409 0.841872 -9.984 < 2e-16 ***
## pregnant 0.103471 0.037973 2.725 0.00643 **
## glucose 0.035730 0.004563 7.830 4.89e-15 ***
## pressure -0.012707 0.006057 -2.098 0.03590 *
## triceps 0.003563 0.008088 0.440 0.65959
## insulin -0.001710 0.001060 -1.613 0.10671
## mass 0.088735 0.017954 4.942 7.72e-07 ***
## pedigree 0.696250 0.334761 2.080 0.03754 *
## age 0.017015 0.011066 1.538 0.12415
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 694.17 on 536 degrees of freedom
## Residual deviance: 509.76 on 528 degrees of freedom
## AIC: 527.76
##
## Number of Fisher Scoring iterations: 5
It’s helpful to know what this object contains. Many people do not bother to look but there is a lot of information contained therein:
str(myglm)
## List of 30
## $ coefficients : Named num [1:9] -8.40541 0.10347 0.03573 -0.01271 0.00356 ...
## ..- attr(*, "names")= chr [1:9] "(Intercept)" "pregnant" "glucose" "pressure" ...
## $ residuals : Named num [1:537] 2.82 -1.23 -4.17 -1.05 -1.11 ...
## ..- attr(*, "names")= chr [1:537] "415" "463" "179" "526" ...
## $ fitted.values : Named num [1:537] 0.3547 0.1858 0.7605 0.0438 0.1003 ...
## ..- attr(*, "names")= chr [1:537] "415" "463" "179" "526" ...
## $ effects : Named num [1:537] 4.67 -3.171 -8.547 -0.638 -1.41 ...
## ..- attr(*, "names")= chr [1:537] "(Intercept)" "pregnant" "glucose" "pressure" ...
## $ R : num [1:9, 1:9] -9.11 0 0 0 0 ...
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : chr [1:9] "(Intercept)" "pregnant" "glucose" "pressure" ...
## .. ..$ : chr [1:9] "(Intercept)" "pregnant" "glucose" "pressure" ...
## $ rank : int 9
## $ qr :List of 5
## ..$ qr : num [1:537, 1:9] -9.1125 0.0427 0.0468 0.0225 0.033 ...
## .. ..- attr(*, "dimnames")=List of 2
## .. .. ..$ : chr [1:537] "415" "463" "179" "526" ...
## .. .. ..$ : chr [1:9] "(Intercept)" "pregnant" "glucose" "pressure" ...
## ..$ rank : int 9
## ..$ qraux: num [1:9] 1.05 1.05 1.03 1 1 ...
## ..$ pivot: int [1:9] 1 2 3 4 5 6 7 8 9
## ..$ tol : num 1e-11
## ..- attr(*, "class")= chr "qr"
## $ family :List of 12
## ..$ family : chr "binomial"
## ..$ link : chr "logit"
## ..$ linkfun :function (mu)
## ..$ linkinv :function (eta)
## ..$ variance :function (mu)
## ..$ dev.resids:function (y, mu, wt)
## ..$ aic :function (y, n, mu, wt, dev)
## ..$ mu.eta :function (eta)
## ..$ initialize: language { if (NCOL(y) == 1) { ...
## ..$ validmu :function (mu)
## ..$ valideta :function (eta)
## ..$ simulate :function (object, nsim)
## ..- attr(*, "class")= chr "family"
## $ linear.predictors: Named num [1:537] -0.599 -1.478 1.155 -3.084 -2.194 ...
## ..- attr(*, "names")= chr [1:537] "415" "463" "179" "526" ...
## $ deviance : num 510
## $ aic : num 528
## $ null.deviance : num 694
## $ iter : int 5
## $ weights : Named num [1:537] 0.2289 0.1513 0.1822 0.0419 0.0903 ...
## ..- attr(*, "names")= chr [1:537] "415" "463" "179" "526" ...
## $ prior.weights : Named num [1:537] 1 1 1 1 1 1 1 1 1 1 ...
## ..- attr(*, "names")= chr [1:537] "415" "463" "179" "526" ...
## $ df.residual : int 528
## $ df.null : int 536
## $ y : Named num [1:537] 1 0 0 0 0 0 1 0 1 1 ...
## ..- attr(*, "names")= chr [1:537] "415" "463" "179" "526" ...
## $ converged : logi TRUE
## $ boundary : logi FALSE
## $ model :'data.frame': 537 obs. of 9 variables:
## ..$ diabetes: Factor w/ 2 levels "neg","pos": 2 1 1 1 1 1 2 1 2 2 ...
## ..$ pregnant: int [1:537] 0 8 5 3 8 5 14 4 6 1 ...
## ..$ glucose : int [1:537] 138 74 143 87 85 78 100 197 119 189 ...
## ..$ pressure: int [1:537] 60 70 78 60 55 48 78 70 50 60 ...
## ..$ triceps : int [1:537] 35 40 0 18 20 0 25 39 22 23 ...
## ..$ insulin : int [1:537] 167 49 0 0 0 0 184 744 176 846 ...
## ..$ mass : num [1:537] 34.6 35.3 45 21.8 24.4 33.7 36.6 36.7 27.1 30.1 ...
## ..$ pedigree: num [1:537] 0.534 0.705 0.19 0.444 0.136 ...
## ..$ age : int [1:537] 21 39 47 21 42 25 46 31 33 59 ...
## ..- attr(*, "terms")=Classes 'terms', 'formula' language diabetes ~ pregnant + glucose + pressure + triceps + insulin + mass + pedigree + age
## .. .. ..- attr(*, "variables")= language list(diabetes, pregnant, glucose, pressure, triceps, insulin, mass, pedigree, age)
## .. .. ..- attr(*, "factors")= int [1:9, 1:8] 0 1 0 0 0 0 0 0 0 0 ...
## .. .. .. ..- attr(*, "dimnames")=List of 2
## .. .. .. .. ..$ : chr [1:9] "diabetes" "pregnant" "glucose" "pressure" ...
## .. .. .. .. ..$ : chr [1:8] "pregnant" "glucose" "pressure" "triceps" ...
## .. .. ..- attr(*, "term.labels")= chr [1:8] "pregnant" "glucose" "pressure" "triceps" ...
## .. .. ..- attr(*, "order")= int [1:8] 1 1 1 1 1 1 1 1
## .. .. ..- attr(*, "intercept")= int 1
## .. .. ..- attr(*, "response")= int 1
## .. .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv>
## .. .. ..- attr(*, "predvars")= language list(diabetes, pregnant, glucose, pressure, triceps, insulin, mass, pedigree, age)
## .. .. ..- attr(*, "dataClasses")= Named chr [1:9] "factor" "numeric" "numeric" "numeric" ...
## .. .. .. ..- attr(*, "names")= chr [1:9] "diabetes" "pregnant" "glucose" "pressure" ...
## $ call : language glm(formula = diabetes ~ ., family = "binomial", data = pm_training)
## $ formula :Class 'formula' language diabetes ~ .
## .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv>
## $ terms :Classes 'terms', 'formula' language diabetes ~ pregnant + glucose + pressure + triceps + insulin + mass + pedigree + age
## .. ..- attr(*, "variables")= language list(diabetes, pregnant, glucose, pressure, triceps, insulin, mass, pedigree, age)
## .. ..- attr(*, "factors")= int [1:9, 1:8] 0 1 0 0 0 0 0 0 0 0 ...
## .. .. ..- attr(*, "dimnames")=List of 2
## .. .. .. ..$ : chr [1:9] "diabetes" "pregnant" "glucose" "pressure" ...
## .. .. .. ..$ : chr [1:8] "pregnant" "glucose" "pressure" "triceps" ...
## .. ..- attr(*, "term.labels")= chr [1:8] "pregnant" "glucose" "pressure" "triceps" ...
## .. ..- attr(*, "order")= int [1:8] 1 1 1 1 1 1 1 1
## .. ..- attr(*, "intercept")= int 1
## .. ..- attr(*, "response")= int 1
## .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv>
## .. ..- attr(*, "predvars")= language list(diabetes, pregnant, glucose, pressure, triceps, insulin, mass, pedigree, age)
## .. ..- attr(*, "dataClasses")= Named chr [1:9] "factor" "numeric" "numeric" "numeric" ...
## .. .. ..- attr(*, "names")= chr [1:9] "diabetes" "pregnant" "glucose" "pressure" ...
## $ data :'data.frame': 537 obs. of 9 variables:
## ..$ pregnant: int [1:537] 0 8 5 3 8 5 14 4 6 1 ...
## ..$ glucose : int [1:537] 138 74 143 87 85 78 100 197 119 189 ...
## ..$ pressure: int [1:537] 60 70 78 60 55 48 78 70 50 60 ...
## ..$ triceps : int [1:537] 35 40 0 18 20 0 25 39 22 23 ...
## ..$ insulin : int [1:537] 167 49 0 0 0 0 184 744 176 846 ...
## ..$ mass : num [1:537] 34.6 35.3 45 21.8 24.4 33.7 36.6 36.7 27.1 30.1 ...
## ..$ pedigree: num [1:537] 0.534 0.705 0.19 0.444 0.136 ...
## ..$ age : int [1:537] 21 39 47 21 42 25 46 31 33 59 ...
## ..$ diabetes: Factor w/ 2 levels "neg","pos": 2 1 1 1 1 1 2 1 2 2 ...
## $ offset : NULL
## $ control :List of 3
## ..$ epsilon: num 1e-08
## ..$ maxit : num 25
## ..$ trace : logi FALSE
## $ method : chr "glm.fit"
## $ contrasts : NULL
## $ xlevels : Named list()
## - attr(*, "class")= chr [1:2] "glm" "lm"
We can access this information by treating it like a list although there are functions such as summary that try to report back the more essential information.
summary(myglm)
##
## Call:
## glm(formula = diabetes ~ ., family = "binomial", data = pm_training)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.2424 -0.7256 -0.4283 0.7341 2.9311
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -8.405409 0.841872 -9.984 < 2e-16 ***
## pregnant 0.103471 0.037973 2.725 0.00643 **
## glucose 0.035730 0.004563 7.830 4.89e-15 ***
## pressure -0.012707 0.006057 -2.098 0.03590 *
## triceps 0.003563 0.008088 0.440 0.65959
## insulin -0.001710 0.001060 -1.613 0.10671
## mass 0.088735 0.017954 4.942 7.72e-07 ***
## pedigree 0.696250 0.334761 2.080 0.03754 *
## age 0.017015 0.011066 1.538 0.12415
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 694.17 on 536 degrees of freedom
## Residual deviance: 509.76 on 528 degrees of freedom
## AIC: 527.76
##
## Number of Fisher Scoring iterations: 5
#plot(myglm)
This will become more important soon but for now just observe that some of the features (e.g. age, insulin, triceps) do not appear to be important or significant though we’ll ignore that for now - just like a lot of professional machine learning specialists do everyday (to their own peril).
We could do some “feature filtering” or “engineering” to first transform or remove features to make the model diagnostics better but that’s outside of the scope of this lecture.
Let’s pretend that all is well and now use this new model to predict outcomes using the test data set. Remember that we are attempting to predict a binary outcome - in this case whether the person is positive for diabetes or negative.
What we get back from the prediction object are probabilities for which we have to determine a threshold above which we would say the observation is “positive” for diabetes and, below the threshold, “negative”.
probs <- predict(myglm,
newdata = pm_testing,
type = "response")
probs[1:10]
## 1 3 4 9 15 17 18
## 0.72751772 0.77340088 0.04425949 0.71110216 0.61359933 0.37579888 0.18733076
## 22 27 28
## 0.30034815 0.73345470 0.04357433
With logistic regression we are dealing with a curve like the one below which is a sigmoid function. The idea is to take our probabilities, which range between 0 and 1, and then pick a threshold (aka “alpha”) over which we would classify that observation / person as being positive for diabetes.
myseq <- seq(-6,6,.1)
myfunc <- function(x) {1/(1+exp(-x))}
plot(myseq,myfunc(myseq),
type = "l",
main = "Standard Logistic Sigmoid Function",
ylab = "")
abline(h=0.5,lty=2)
abline(v=0,lty=2)
text(-0.5,0.55,"0.5")
grid()
The temptation is to select 0.5 as the threshold such that if a returned probability exceeds 0.5 then we classify the associated subject as being “positive” for the disease. But then this assumes that the probabilities are distributed accordingly. This is frequently not the case though it doesn’t stop people from using 0.5. Another way to view this is as follows. This graphic shows a “perfect” classifier which more or less matches the logit function above:
Note that this is a rare situation wherein there exists a clean separation between negative and positive cases. We could have something like this:
Or this:
One thing we should do here is to look at the distribution of the returned probabilities before making a decision about where to set the threshold. We can see clearly that selecting 0.5 in this case would not be appropriate.
boxplot(probs,
main="Probabilities from our GLM Model")
grid()
The median appears to be somewhere around .25 so we could use that for now although we are just guessing.
mypreds <- ifelse(probs > 0.25,"pos","neg")
mypreds <- factor(mypreds, levels = levels(pm_testing[["diabetes"]]))
mypreds[1:10]
## 1 3 4 9 15 17 18 22 27 28
## pos pos neg pos pos pos neg pos pos neg
## Levels: neg pos
Next, we would compare our predictions against the known outcomes which are stored in the test data frame:
# How does this compare to the truth ?
table(predicted = mypreds,
actual = pm_testing$diabetes)
## actual
## predicted neg pos
## neg 99 16
## pos 51 65
What we are doing is building a “Confusion Matrix” which can help us determine how effective our model is. From such a matrix table we can compute a number of “performance measures”, such as accuracy, precision, sensitivity, specificity and others, to help assess the quality of the model. In predictive modeling we are always interested in how well any given model will perform on “new” data.
There are some functions that can help us compute a confusion matrix. Because the variable we are trying to predict, (diabetes), is a two level factor, (“neg” or “pos”) we’ll need to turn our predictions into a comparable factor. Right now, it’s just a character string.
caret::confusionMatrix(mypreds,pm_testing$diabetes)
## Confusion Matrix and Statistics
##
## Reference
## Prediction neg pos
## neg 99 16
## pos 51 65
##
## Accuracy : 0.71
## 95% CI : (0.6468, 0.7676)
## No Information Rate : 0.6494
## P-Value [Acc > NIR] : 0.02999
##
## Kappa : 0.4207
##
## Mcnemar's Test P-Value : 3.271e-05
##
## Sensitivity : 0.6600
## Specificity : 0.8025
## Pos Pred Value : 0.8609
## Neg Pred Value : 0.5603
## Prevalence : 0.6494
## Detection Rate : 0.4286
## Detection Prevalence : 0.4978
## Balanced Accuracy : 0.7312
##
## 'Positive' Class : neg
##
fourfoldplot(caret::confusionMatrix(mypreds,pm_testing$diabetes)$table)
newpreds <- ifelse(probs > 0.55,"pos","neg")
newpreds <- factor(newpreds, levels = levels(pm_testing[["diabetes"]]))
newpreds[1:10]
## 1 3 4 9 15 17 18 22 27 28
## pos pos neg pos pos neg neg neg pos neg
## Levels: neg pos
table(predicted = newpreds,
actual = pm_testing$diabetes)
## actual
## predicted neg pos
## neg 141 42
## pos 9 39
caret::confusionMatrix(newpreds,pm_testing$diabetes)
## Confusion Matrix and Statistics
##
## Reference
## Prediction neg pos
## neg 141 42
## pos 9 39
##
## Accuracy : 0.7792
## 95% CI : (0.7201, 0.831)
## No Information Rate : 0.6494
## P-Value [Acc > NIR] : 1.269e-05
##
## Kappa : 0.4651
##
## Mcnemar's Test P-Value : 7.433e-06
##
## Sensitivity : 0.9400
## Specificity : 0.4815
## Pos Pred Value : 0.7705
## Neg Pred Value : 0.8125
## Prevalence : 0.6494
## Detection Rate : 0.6104
## Detection Prevalence : 0.7922
## Balanced Accuracy : 0.7107
##
## 'Positive' Class : neg
##
fourfoldplot(caret::confusionMatrix(newpreds,pm_testing$diabetes)$table)
This is helpful stuff although there are a number of measures to select as a primary performance metric. Ideally, we would already know which performance metric we would select to effectively “judge” the quality of our model.
In medical tests, “sensitivity” and “specificity” are commonly used. Some applications use “Accuracy” (which isn’t good when there is large group imbalance).
The problem here is that all we have done is looked at the confusion matrix corresponding to one specific (and arbitrary) threshold value when what we need is to look at a number of confusion matrices corresponding to many different thresholds.
For example, we might get a better sensitivity level had we selected the mean of the returned probabilities. This process could go on and on and on.
One way to do this is to use something known as the ROC curve. Luckily, R has functions to do this. This isn’t surprising as it is a standard tool that has been in use for decades long before the hype of AI and ML was around. The ROC curve gives us a “one stop shop” for estimating a good value of alpha.
We want to maximize the area under a given ROC curve as it winds up being an effective way to judge the differences between one method and another. So, if we wanted to compare the glm model against a Random Forest model, we could use the respective AUC (Area Under Curve) metric to help us. This isn’t the only way to do this but it’s reasonable for now.
pred <- ROCR::prediction(predictions = probs,
labels = pm_testing$diabetes)
perf <- ROCR::performance(pred,
"tpr",
"fpr")
myroc <- ROCR::performance(pred,measure="auc")
plot(perf,colorize=T,
print.cutoffs.at=seq(0,1,by=0.1),
lwd=3,las=1,main=paste("Cool Ranger ROC Curve with",round(myroc@y.values[[1]],3),"AUC"))
abline(a = 0, b = 1)
grid()
So what value of alpha corresponds to the statedAUC of .85 ? We’ll have to dig into the performance object to get that but it looks to be between 0.30 and 0.40. Note that this is somewhat academic since knowing the max AUC alone helps us decide if our model is any “good”. For completeness we could use another R function to nail this down:
proc <- pROC::roc(pm_testing$diabetes,probs)
## Setting levels: control = neg, case = pos
## Setting direction: controls < cases
round(pROC::coords(proc, "b", ret="t", transpose = FALSE),2)
## threshold
## 1 0.36
Generally speaking, here is a graphic which shows you some ROC curve shapes and how they relate to AUC.
We haven’t accomplished very much here because we need to look at multiple versions of the data in case we sampled a number of outliers in the creation of our training data. Or, maybe we have excluded a large number of outliers in the training set so they wound up in the test data set which means that the predictive power of our model isn’t as robust as it should be.
Our next steps should involve creating multiple versions of the training and test pairs (say 3 times), compute the optimal AUC, and then look at how those values vary for each of those individual versions. If the AUCs vary widely then maybe our model is over training. If it’s not varying widely, it could be that that the model has high bias.
What if, when splitting the data into a training and testing pair, the test set wound up containing most of the outliers in the data (assuming there were any). This means that our trained model would probably under perform on the test data since the model did not “see” most of the outliers.
Or, consider the situation wherein the training data had most of the outliers and the resulting model was influenced by this. So, when applied to the testing data, the predictive results are not that great.
A way to combat this is to split the data into K number of groups aka “folds” - say 3. Then set up a loop that uses K-1 folds to train a model while using the “hold out” fold as the test data set. This means that each fold is used as a test set once.
It also means that each fold also gets used as part of a trained model thus we can offset the impact of any outliers found in the data set. Note, we are not trying to perfectly model the data, we just want to get a realistic idea as to how it might perform on unseen data.
The method works by giving us multiple estimates of out-of-sample error, rather than a single estimate. Let’s assume K is equal to 3 so we will partition our data into 3 individual “folds”" which are basically equal in size. Then we’ll create a loop that does the following:
Combines 2 of the folds into a training data set
Builds a model on the combined 2-folds data
Applies the model to holdout fold
Computes the AUC value and stores it
Each fold is simply a portion of the data. We’ll generate a list called “folds” that contains 3 elements each of which are 256 index elements corresponding to rows in pm. The way we did the sample insures that each row shows up only in one fold.
We could write our own function to not only create some number of folds but to also build a model, make predictions and store the accuracy for a given set of folds. Remember, each fold gets to be a “test” data set as part of the process.
cross_fold <- function(numofolds = 4,df=pm) {
# Function to Do Cross fold validation
# Split the data into K folds (numofolds)
folds <- split(sample(1:nrow(df)),1:numofolds)
# We setup some blank lists to stash results
folddf <- list() # Contains folds
modl <- list() # Hold each of the K models
predl <- list() # Hold rach of the K predictions
auc <- list() # Hold the auc for a given model
# Create a formula that can be used across multiple
# iterations through the loop.
myform <- "diabetes ~ ."
for (ii in 1:length(folds)) {
# This list holds the actual model we create for each of the
# 10 folds
modl[[ii]] <- glm(formula = myform,
data = df[-folds[[ii]],],
family = "binomial")
# This list will contain / hold the models build on the fold
predl[[ii]] <- predict(modl[[ii]],
newdata=df[folds[[ii]],],
type="response")
# This list will hold the results of the AUC per iteration
pred <- ROCR::prediction(predl[[ii]],
df[folds[[ii]],]$diabetes)
roc <- ROCR::performance(pred,measure="auc")
auc[[ii]] <- roc@y.values[[1]]
}
return(unlist(auc))
}
Running this is now quite simple. By default, this function will loop some number of times corresponding to the number of folds. During each iteration it will:
use glm to build a model on the training folds
create a prediction object using the training fold
compute the underlying AUC associated with the prediction
store the AUC in a vector
At the end of the function, the vector containing the computed AUCs will be returned.
An advantage of this approach is that we can now look at a range of accuracy values (K of them) instead of just one. This gives us a better idea about how the model might perform on future data. Here we will look at accuracy across 8 folds:
numofolds <- 8
boxplot(cross_fold(numofolds = numofolds),
main=paste("Distribution of Accuracy for",numofolds,"folds"),
horizontal = TRUE)
grid()
lattice::stripplot(cross_fold(8),
main="AUC values for K-Fold Validation",
type=c("g","p"),pch=19,cex=1.5)
We can even replicate this to approximate something known as repeated cross validation:
numofolds <- 8
numofrepl <- 20
main <- paste("Accuracy Across",numofolds,"folds","\n and",numofrepl,"replications")
reps <- replicate(numofrepl,cross_fold())
boxplot(reps,
main=main,cex.axis=.8,
xlab="Simulation Number",
ylim=c(min(reps)-0.025,max(reps)+0.025),
ylab="Accuracy")
grid()
So what have we done here. Well let’s recap:
Then we might extend this by doing K-Fold cross validation at Step 2 to get a better idea as to how the model we built might apply to the test data. If we wanted to use another method, say random forests, we could but that would require us to understand the calling sequence for that method.
We could do all of the above in a more general manner by using something called tidymodels. There is also a package called caret which does similar things using base R. But since the author of caret, Max Kuhn, has now joined the tidymodels team it seems that caret is in maintenance mode only. In any case, to understand tidymodels one must first understand the tidyverse.
The tidyverse represents an extension on top of the base functions found in the default installation of R. Using tidy commands is an entirely optional activity and there is no need to use them except in pursuit of the concepts the tidyverse seeks to implement.
That sounds like circular logic but the idea is to pursue an approach that generalizes across multiple steps in data manipulation and modeling. Being able to easily reuse R objects (data, results, figures) is a big deal because it simplifies the construction and comparison of models which in turn facilitates reproducibility.
One would also like to be able to transfer knowledge acquired in using one package to other packages in the R environment. That said, in my view (and it’s only that), having a good knowledge of base R is important since you are likely to encounter “legacy” code which does not use things like the tidyverse or tidymodels.
Some introductory courses dive straight into the tidyverse and that is fine to an extent. It’s just that if you enter a work environment that has lots of older R code you will need to work to understand it. As this presentation assumes previous experience with R we will not be going over R basics.
The tidymodels package comprises a number of supporting packages and installing it is as simple as doing:
install.packages("tidymodels")
The tidymodels package relies on the tidyverse package which provides verbs that correspond to common data manipulation activities. The tidyverse package also provides the ggplot2 package which is a premier visualization tool. Back to the verbs. Here are some basic examples using the mtcars data frame which is frequently used in R education due its simplicity and small size.
Let’s filter the data based on values assumed by one or more columns.
filter(pm, pregnant > 10 & age < 40)
## pregnant glucose pressure triceps insulin mass pedigree age diabetes
## 1 11 138 76 0 0 33.2 0.420 35 neg
## 2 12 151 70 40 271 41.8 0.742 38 pos
## 3 14 175 62 30 0 33.6 0.212 38 pos
## 4 11 85 74 0 0 30.1 0.300 35 neg
## 5 13 104 72 0 0 31.2 0.465 38 pos
## 6 13 153 88 37 140 40.6 1.174 39 neg
# Get all rows where diabetes is "pos" and mass is < 20
filter(pm, diabetes == "pos" & mass < 20)
## pregnant glucose pressure triceps insulin mass pedigree age diabetes
## 1 8 125 96 0 0 0 0.232 54 pos
## 2 10 115 0 0 0 0 0.261 30 pos
We can even do aggregation using tidyverse functions. What is the average glucose per diabetes group?
summarize(group_by(pm,diabetes),avg=mean(glucose))
## # A tibble: 2 × 2
## diabetes avg
## <fct> <dbl>
## 1 neg 110.
## 2 pos 141.
One of the coolest functions provided by tidyverse is the pipe operator which allows us to feed the output of one command to the input of another in a left-to-right fashion. This concept is now new and was originally developed for use in the UNIX / LINUX operating system.
Base R uses composite functions in accordance of its heritage as a mathematical / statistical language. So structures like f(g(x)) or f(g(h(x))) are easily accommodated but prone to errors in typing due to the need to match parentheses.
# The composite function approach so Favored by mathematicians
x <- 3.14
log(cos(sin(tan(x))),base=10)
## [1] -5.508045e-07
Using the pipe structure this would look like the following which allows you to build the line a command at a time which makes it less error prone. You “read” the command from left to right where the output of one command becomes the input to another and so on.
x %>% tan() %>% sin() %>% cos() %>% log(base=10)
## [1] -5.508045e-07
# or even
3.14 %>% tan() %>% sin() %>% cos() %>% log(base=10)
## [1] -5.508045e-07
# The pipe operator is %>%
pm %>% filter(diabetes == "pos" & mass < 20)
## pregnant glucose pressure triceps insulin mass pedigree age diabetes
## 1 8 125 96 0 0 0 0.232 54 pos
## 2 10 115 0 0 0 0 0.261 30 pos
We can conveniently sort data. In this case we filter out cars with MPG greater than 15 and a weight less than 2 tons and then arrange the result in order of highest MPG to lowest.
pm %>% filter(diabetes == "pos" & mass < 25) %>% arrange(desc(glucose))
## pregnant glucose pressure triceps insulin mass pedigree age diabetes
## 1 6 194 78 0 0 23.5 0.129 59 pos
## 2 8 183 64 0 0 23.3 0.672 32 pos
## 3 1 167 74 17 144 23.4 0.447 33 pos
## 4 6 162 62 0 0 24.3 0.178 50 pos
## 5 9 156 86 0 0 24.8 0.230 53 pos
## 6 4 134 72 0 0 23.8 0.277 60 pos
## 7 8 125 96 0 0 0.0 0.232 54 pos
## 8 10 115 0 0 0 0.0 0.261 30 pos
## 9 3 107 62 13 48 22.9 0.678 23 pos
As applied to our above grouping example we could de-couple the somewhat verbose Base R command line:
pm %>% group_by(diabetes) %>% summarize(average_glucose=mean(glucose))
## # A tibble: 2 × 2
## diabetes average_glucose
## <fct> <dbl>
## 1 neg 110.
## 2 pos 141.
Note that the “%>%” operator allows us to route the output of a command to the input of another command. Here is a more simple example:
pm %>% filter(mass > 50)
## pregnant glucose pressure triceps insulin mass pedigree age diabetes
## 1 0 162 76 56 100 53.2 0.759 25 pos
## 2 1 88 30 42 99 55.0 0.496 26 pos
## 3 0 129 110 46 130 67.1 0.319 26 pos
## 4 11 135 0 0 0 52.3 0.578 40 pos
## 5 0 165 90 33 680 52.3 0.427 23 neg
## 6 5 115 98 0 0 52.9 0.209 28 pos
## 7 0 180 78 63 14 59.4 2.420 25 pos
## 8 3 123 100 35 240 57.3 0.880 22 neg
# same as
filter(pm, mass > 50)
## pregnant glucose pressure triceps insulin mass pedigree age diabetes
## 1 0 162 76 56 100 53.2 0.759 25 pos
## 2 1 88 30 42 99 55.0 0.496 26 pos
## 3 0 129 110 46 130 67.1 0.319 26 pos
## 4 11 135 0 0 0 52.3 0.578 40 pos
## 5 0 165 90 33 680 52.3 0.427 23 neg
## 6 5 115 98 0 0 52.9 0.209 28 pos
## 7 0 180 78 63 14 59.4 2.420 25 pos
## 8 3 123 100 35 240 57.3 0.880 22 neg
Back to our aggregation pipeline, we can “pipe” the result into the ggplot function to get a visualization. ggplot requires us to provide it tables or data frames. Actually it prefers “tibbles” which is a type of data frame but that’s getting off track a little bit.
pm %>%
group_by(diabetes) %>%
summarize(average_glucose=mean(glucose)) %>%
ggplot(aes(x=diabetes,y=average_glucose)) +
geom_bar(stat="identity") +
ggtitle("Avg Glucose / Group")
If we wanted to improve upon this we could, perhaps by making the transmission type into a factor with more intuitive labels. Here we will introduce another tidyverse function to mutate the am variable into a factor.
pm %>%
group_by(diabetes) %>%
summarize(average_glucose=mean(glucose)) %>%
ggplot(aes(x=diabetes,y=average_glucose)) +
geom_bar(stat="identity") +
theme_light() +
labs(title = "Average Glucose / Group",
subtitle = "Data extracted from Pima Indians Data set",
caption = "Super Cool tidyverse example")
A distinct advantage of this approach is that we did not change in anyway the original data frame. Check it out:
head(pm)
## pregnant glucose pressure triceps insulin mass pedigree age diabetes
## 1 6 148 72 35 0 33.6 0.627 50 pos
## 2 1 85 66 29 0 26.6 0.351 31 neg
## 3 8 183 64 0 0 23.3 0.672 32 pos
## 4 1 89 66 23 94 28.1 0.167 21 neg
## 5 0 137 40 35 168 43.1 2.288 33 pos
## 6 5 116 74 0 0 25.6 0.201 30 neg
The piping allows for free-form experimentation without the need for having to save temporary and/pr incremental versions of data frames which is frequently the case when using Base R commands. Here is another example:
pm %>%
ggplot(aes(x=mass,y=glucose,color=diabetes)) +
geom_point() +
labs(
title = "Glucose Level vs Mass",
caption = "Data from the Pima Indians Data set.",
tag = "Figure 1",
x = "Mass in kg(height in m)^2",
y = "Plasma glucose concentration",
colour = "diabetes"
) + theme_bw()
The tidyverse has a way to help clean up the results offered by Base R packages. Consider the following:
fit <- lm(glucose ~ mass + age,pm_training)
Look at the output which is not directly in a convenient format.
fit
##
## Call:
## lm(formula = glucose ~ mass + age, data = pm_training)
##
## Coefficients:
## (Intercept) mass age
## 68.6641 0.9534 0.6775
summary(fit)
##
## Call:
## lm(formula = glucose ~ mass + age, data = pm_training)
##
## Residuals:
## Min 1Q Median 3Q Max
## -133.625 -19.214 -1.093 18.559 83.383
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 68.6641 6.6155 10.379 < 2e-16 ***
## mass 0.9534 0.1663 5.732 1.66e-08 ***
## age 0.6775 0.1097 6.173 1.32e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 30.07 on 534 degrees of freedom
## Multiple R-squared: 0.1161, Adjusted R-squared: 0.1128
## F-statistic: 35.07 on 2 and 534 DF, p-value: 4.893e-15
Access coefficients only:
fit$coefficients
## (Intercept) mass age
## 68.6640503 0.9534197 0.6774946
But with a tidy approach you get things in a data frame which is inarguably the most flexible data type in the R language. The tidy function “produces a tibble() where each row contains information about an important component of the model. For regression models, this often corresponds to regression coefficients”
tidy(fit)
## # A tibble: 3 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 68.7 6.62 10.4 4.09e-23
## 2 mass 0.953 0.166 5.73 1.66e- 8
## 3 age 0.677 0.110 6.17 1.32e- 9
The glance function “returns a tibble with exactly one row of goodness of fitness measures and related statistics. This is useful to check for model misspecification and to compare many models”
glance(fit)
## # A tibble: 1 × 12
## r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.116 0.113 30.1 35.1 4.89e-15 2 -2588. 5184. 5201.
## # … with 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
The augment function “adds columns to a dataset, containing information such as fitted values, residuals or cluster assignments. All columns added to a dataset have . prefix to prevent existing columns from being overwritten.”
augment(fit, pm_training)[1:10,]
## # A tibble: 10 × 16
## .rownames pregnant glucose pressure triceps insulin mass pedigree age
## <chr> <int> <int> <int> <int> <int> <dbl> <dbl> <int>
## 1 415 0 138 60 35 167 34.6 0.534 21
## 2 463 8 74 70 40 49 35.3 0.705 39
## 3 179 5 143 78 0 0 45 0.19 47
## 4 526 3 87 60 18 0 21.8 0.444 21
## 5 195 8 85 55 20 0 24.4 0.136 42
## 6 118 5 78 48 0 0 33.7 0.654 25
## 7 299 14 100 78 25 184 36.6 0.412 46
## 8 229 4 197 70 39 744 36.7 2.33 31
## 9 244 6 119 50 22 176 27.1 1.32 33
## 10 14 1 189 60 23 846 30.1 0.398 59
## # … with 7 more variables: diabetes <fct>, .fitted <dbl>, .resid <dbl>,
## # .hat <dbl>, .sigma <dbl>, .cooksd <dbl>, .std.resid <dbl>
augment(fit, pm_training) %>%
ggplot(aes(x=glucose,y=.fitted,color=diabetes)) +
geom_point() + labs(title="Fitted vs Glucose") + theme_bw()
The tidymodels package can be useful even if you are just experimenting since it provides a standard and uniform approach by which to work. In Base R, each modelling function, lm, glm, rpart, svm, will all have arguments specific to the respective command, which contributes to confusion when switching between models for purposes of comparison. This is where tidymodels can be very helpful.
One of the benefits of the package is that it provides front ends for the many methods available in R which includes Decision trees and extensions there upon. An advantage of Base R when building models is that many methods will convert the target variable to a label / nominal feature without you asking it to. This is a convenience though the tidyverse generally wants you do be explicit about this.
url <- "https://raw.githubusercontent.com/steviep42/bios534_spring_2020/master/data/pima.csv"
pm <- read.csv(url)
pm <- pm %>% mutate(diabetes=factor(diabetes))
skimr::skim(pm)
| Name | pm |
| Number of rows | 768 |
| Number of columns | 9 |
| _______________________ | |
| Column type frequency: | |
| factor | 1 |
| numeric | 8 |
| ________________________ | |
| Group variables | None |
Variable type: factor
| skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
|---|---|---|---|---|---|
| diabetes | 0 | 1 | FALSE | 2 | neg: 500, pos: 268 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| pregnant | 0 | 1 | 3.85 | 3.37 | 0.00 | 1.00 | 3.00 | 6.00 | 17.00 | ▇▃▂▁▁ |
| glucose | 0 | 1 | 120.89 | 31.97 | 0.00 | 99.00 | 117.00 | 140.25 | 199.00 | ▁▁▇▆▂ |
| pressure | 0 | 1 | 69.11 | 19.36 | 0.00 | 62.00 | 72.00 | 80.00 | 122.00 | ▁▁▇▇▁ |
| triceps | 0 | 1 | 20.54 | 15.95 | 0.00 | 0.00 | 23.00 | 32.00 | 99.00 | ▇▇▂▁▁ |
| insulin | 0 | 1 | 79.80 | 115.24 | 0.00 | 0.00 | 30.50 | 127.25 | 846.00 | ▇▁▁▁▁ |
| mass | 0 | 1 | 31.99 | 7.88 | 0.00 | 27.30 | 32.00 | 36.60 | 67.10 | ▁▃▇▂▁ |
| pedigree | 0 | 1 | 0.47 | 0.33 | 0.08 | 0.24 | 0.37 | 0.63 | 2.42 | ▇▃▁▁▁ |
| age | 0 | 1 | 33.24 | 11.76 | 21.00 | 24.00 | 29.00 | 41.00 | 81.00 | ▇▃▁▁▁ |
We’ll repeat our glm example though this time using tidymodels. It is important to understand that the tidymodels package does not attempt to replace or rewrite common and popular R packages for model assembly.
The package provides an standard and uniform interface that “sits on top” of various package functions. The advantage to the user is that you do not necessarily need to know all of the individual arguments to a specific command in order to use it though if you do then you can still leverage that information.
pm_tidy_glm <- logistic_reg(mode = "classification") %>%
set_engine("glm")
pm_tidy_glm
## Logistic Regression Model Specification (classification)
##
## Computational engine: glm
Tp get more info:
pm_tidy_glm %>% translate()
## Logistic Regression Model Specification (classification)
##
## Computational engine: glm
##
## Model fit template:
## stats::glm(formula = missing_arg(), data = missing_arg(), weights = missing_arg(),
## family = stats::binomial)
pm_tidy_glm_fit <- logistic_reg(mode = "classification") %>%
set_engine("glm") %>%
fit(diabetes ~ ., data = pm_training)
What do we get back?
pm_tidy_glm_fit
## parsnip model object
##
## Fit time: 4ms
##
## Call: stats::glm(formula = diabetes ~ ., family = stats::binomial,
## data = data)
##
## Coefficients:
## (Intercept) pregnant glucose pressure triceps insulin
## -8.405409 0.103471 0.035730 -0.012707 0.003563 -0.001710
## mass pedigree age
## 0.088735 0.696250 0.017015
##
## Degrees of Freedom: 536 Total (i.e. Null); 528 Residual
## Null Deviance: 694.2
## Residual Deviance: 509.8 AIC: 527.8
In terms of making a prediction, we can do that easily. The predict function is what we call a generic function in R which understands how to do its work based on the type of object and data that it is being given. Remember that we built a model which returned an object. We can pass that to the predict function:
glance(pm_tidy_glm_fit)
## # A tibble: 1 × 8
## null.deviance df.null logLik AIC BIC deviance df.residual nobs
## <dbl> <int> <dbl> <dbl> <dbl> <dbl> <int> <int>
## 1 694. 536 -255. 528. 566. 510. 528 537
# Use the ranger model to make a prediction
test_preds <- predict(pm_tidy_glm_fit,pm_testing)
test_preds[1:10,]
## # A tibble: 10 × 1
## .pred_class
## <fct>
## 1 pos
## 2 pos
## 3 neg
## 4 pos
## 5 pos
## 6 neg
## 7 neg
## 8 neg
## 9 pos
## 10 neg
Note that we also have access to probabilities which could be useful if we needed to generate a ROC curve or wanted to take finer grained control over thresholds.
predict(pm_tidy_glm_fit,
pm_testing,type="prob")[1:10,]
## # A tibble: 10 × 2
## .pred_neg .pred_pos
## <dbl> <dbl>
## 1 0.272 0.728
## 2 0.227 0.773
## 3 0.956 0.0443
## 4 0.289 0.711
## 5 0.386 0.614
## 6 0.624 0.376
## 7 0.813 0.187
## 8 0.700 0.300
## 9 0.267 0.733
## 10 0.956 0.0436
By default we get back some predictions as to whether a subject is positive or negative for diabetes which we could then compare to the actual values. This would allow us to create a confusion matrix that would then permit the computation of various performance measures such as accuracy, sensitivity, specificity, etc.
In the tidymodels world, there are functions that will give this information to us without the need for the confusion matrix unless we wanted to print it.
The following shows us the predictions as part of the data frame. The tidyverse very much likes for things to be in a data frame format - actually something called a tibble which is a data frame optimized for use with the tidyverse.
pm_tidy_glm_fit %>%
predict(pm_testing) %>%
bind_cols(pm_testing) %>%
glimpse()
## Rows: 231
## Columns: 10
## $ .pred_class <fct> pos, pos, neg, pos, pos, neg, neg, neg, pos, neg, pos, neg…
## $ pregnant <int> 6, 8, 1, 2, 5, 0, 7, 8, 7, 1, 3, 10, 7, 7, 9, 0, 5, 1, 0, …
## $ glucose <int> 148, 183, 89, 197, 166, 118, 107, 99, 147, 97, 158, 122, 1…
## $ pressure <int> 72, 64, 66, 70, 72, 84, 74, 84, 76, 66, 76, 78, 84, 92, 11…
## $ triceps <int> 35, 0, 23, 45, 19, 47, 0, 0, 0, 15, 36, 31, 0, 18, 24, 39,…
## $ insulin <int> 0, 0, 94, 543, 175, 230, 0, 0, 0, 140, 245, 0, 0, 0, 240, …
## $ mass <dbl> 33.6, 23.3, 28.1, 30.5, 25.8, 45.8, 29.6, 35.4, 39.4, 23.2…
## $ pedigree <dbl> 0.627, 0.672, 0.167, 0.158, 0.587, 0.551, 0.254, 0.388, 0.…
## $ age <int> 50, 32, 21, 53, 51, 31, 31, 50, 43, 22, 28, 45, 37, 48, 54…
## $ diabetes <fct> pos, pos, neg, pos, pos, pos, pos, neg, pos, neg, pos, neg…
This sets the stage for evaluating the accuracy of the predictions resulting from the model:
my_metrics <- metric_set(accuracy,sensitivity,specificity, precision)
pm_tidy_glm_fit %>%
predict(pm_testing) %>%
bind_cols(pm_testing) %>%
my_metrics(truth=diabetes, estimate=.pred_class,event_level = "second")
## # A tibble: 4 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 accuracy binary 0.792
## 2 sens binary 0.556
## 3 spec binary 0.92
## 4 precision binary 0.789
So we could derive all the metrics directly from the confusion matrix - if we wanted to
pm_tidy_glm_fit %>%
predict(pm_testing) %>%
bind_cols(pm_testing) %>%
conf_mat(truth=diabetes, estimate=.pred_class)
## Truth
## Prediction neg pos
## neg 138 36
## pos 12 45
But we don’t need to do this as tidymodels has its own way of getting multiple metrics all at once or a subset thereof:
pm_tidy_glm_fit %>%
predict(pm_testing) %>%
bind_cols(pm_testing) %>%
conf_mat(truth=diabetes, estimate=.pred_class) %>%
summary()
## # A tibble: 13 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 accuracy binary 0.792
## 2 kap binary 0.510
## 3 sens binary 0.92
## 4 spec binary 0.556
## 5 ppv binary 0.793
## 6 npv binary 0.789
## 7 mcc binary 0.526
## 8 j_index binary 0.476
## 9 bal_accuracy binary 0.738
## 10 detection_prevalence binary 0.753
## 11 precision binary 0.793
## 12 recall binary 0.92
## 13 f_meas binary 0.852
Or you can specify just the ones you want without having to even generate the confusion matrix
# Note that sensitivity is also known as recall
my_metrics <- metric_set(accuracy, sensitivity, specificity, precision)
pm_tidy_glm_fit %>%
predict(pm_testing) %>%
bind_cols(pm_testing) %>%
my_metrics(truth=diabetes, estimate=.pred_class, event_level = "second")
## # A tibble: 4 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 accuracy binary 0.792
## 2 sens binary 0.556
## 3 spec binary 0.92
## 4 precision binary 0.789
And what about drawing ROC curves? The tidymodels package has that covered. We just have to remember to get the probabilities associated with the prediction instead of the labels.
pm_tidy_glm_fit %>%
predict(pm_testing,type="prob") %>% # set type="prob"
bind_cols(pm_testing) %>%
roc_curve(truth=diabetes,estimate=.pred_pos,event_level="second") %>%
autoplot()
We can improve upon the default plot. First, we’ll get the auc value to use in our labels. Then we’ll use this information along with various ggplot functions.
# Get the AUC
pm_tidy_glm_roc_auc <- pm_tidy_glm_fit %>%
predict(pm_testing,type="prob") %>%
bind_cols(pm_testing) %>%
roc_auc(truth=diabetes,estimate=.pred_pos,event_level="second")
# Put up an annotated plot
pm_tidy_glm_fit %>%
predict(pm_testing,type="prob") %>%
bind_cols(pm_testing) %>%
roc_curve(truth=diabetes,estimate=.pred_pos,event_level="second") %>%
ggplot(aes(x=1 - specificity, y = sensitivity)) +
geom_path() +
geom_abline(lty=3) +
coord_equal() +
labs(title=paste("ROC Curve for glm with",round(pm_tidy_glm_roc_auc$.estimate,3),"AUC")) +
theme_bw()
Let’s take a step back into the definition of the model to learn a little more about what is happening at a basic level. So we wanted to do logistic regression wich can be done as follows:
logistic_reg() %>% translate()
## Logistic Regression Model Specification (classification)
##
## Computational engine: glm
##
## Model fit template:
## stats::glm(formula = missing_arg(), data = missing_arg(), weights = missing_arg(),
## family = stats::binomial)
By default, the above function will select a computational engine of “glm” although we can choose from a number of engines which correspond to different underlying regression methods including the following:
show_engines("logistic_reg")
## # A tibble: 6 × 2
## engine mode
## <chr> <chr>
## 1 glm classification
## 2 glmnet classification
## 3 LiblineaR classification
## 4 spark classification
## 5 keras classification
## 6 stan classification
We can choose a different underlying method if we so desire:
logistic_reg(mode = "classification",penalty=1) %>%
set_engine("glmnet") %>%
translate()
## Logistic Regression Model Specification (classification)
##
## Main Arguments:
## penalty = 1
##
## Computational engine: glmnet
##
## Model fit template:
## glmnet::glmnet(x = missing_arg(), y = missing_arg(), weights = missing_arg(),
## family = "binomial")
This could be economized to:
logistic_reg(mode = "classification",engine="glmnet",penalty=1) %>%
translate()
## Logistic Regression Model Specification (classification)
##
## Main Arguments:
## penalty = 1
##
## Computational engine: glmnet
##
## Model fit template:
## glmnet::glmnet(x = missing_arg(), y = missing_arg(), weights = missing_arg(),
## family = "binomial")
Jumping over to another method is easy. Let’s build a random forest using the randomForest package. While we provide some arguments specific to the randomForest function we still have a lot of similarity in our previous approach. Note, we do not have to separate the creation of the model from the prediction phase (unless we want to).
# We can get multiple metrics at once.
my_metrics <- metric_set(accuracy, sensitivity, specificity, precision)
# Let's get
pm_tidy_preds_rf <- rand_forest(trees = 100, mode = "classification") %>%
set_engine("randomForest") %>%
fit(diabetes ~ ., data = pm_training) %>%
predict(pm_testing) %>%
bind_cols(pm_testing) %>%
my_metrics(truth=diabetes, estimate=.pred_class, event_level = "second")
pm_tidy_preds_rf
## # A tibble: 4 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 accuracy binary 0.745
## 2 sens binary 0.543
## 3 spec binary 0.853
## 4 precision binary 0.667
So another nice thing about the tidymodels package is that it provides a way to create and manage the training and test pair. We implemented our own code to do this in the earlier example. While that is okay to do, the package provides ways to do this via the initial_split function.
set.seed(123)
pm_split <- initial_split(pm,prop=0.7)
pm_split
## <Analysis/Assess/Total>
## <537/231/768>
This gives us a self-contained object with all the information necessary to get the training and test data on demand. The following will access the training data
training(pm_split)[1:10,]
## pregnant glucose pressure triceps insulin mass pedigree age diabetes
## 415 0 138 60 35 167 34.6 0.534 21 pos
## 463 8 74 70 40 49 35.3 0.705 39 neg
## 179 5 143 78 0 0 45.0 0.190 47 neg
## 526 3 87 60 18 0 21.8 0.444 21 neg
## 195 8 85 55 20 0 24.4 0.136 42 neg
## 118 5 78 48 0 0 33.7 0.654 25 neg
## 299 14 100 78 25 184 36.6 0.412 46 pos
## 229 4 197 70 39 744 36.7 2.329 31 neg
## 244 6 119 50 22 176 27.1 1.318 33 pos
## 14 1 189 60 23 846 30.1 0.398 59 pos
The following will get the test data
testing(pm_split)[1:10,]
## pregnant glucose pressure triceps insulin mass pedigree age diabetes
## 1 6 148 72 35 0 33.6 0.627 50 pos
## 3 8 183 64 0 0 23.3 0.672 32 pos
## 4 1 89 66 23 94 28.1 0.167 21 neg
## 9 2 197 70 45 543 30.5 0.158 53 pos
## 15 5 166 72 19 175 25.8 0.587 51 pos
## 17 0 118 84 47 230 45.8 0.551 31 pos
## 18 7 107 74 0 0 29.6 0.254 31 pos
## 22 8 99 84 0 0 35.4 0.388 50 neg
## 27 7 147 76 0 0 39.4 0.257 43 pos
## 28 1 97 66 15 140 23.2 0.487 22 neg
Let’s repeat our glm model on the new training data. We don’t expect much difference. This is just illustrating how we can take advantage of tidymodels functions to help us.
# Get the Training data
pm_training <- training(pm_split)
# Define a model
pm_tidy_glm_fit <- logistic_reg(mode = "classification") %>%
set_engine("glm") %>%
fit(diabetes ~ ., data = pm_training)
Now let’s do some predictions using the test data
# Get the Testing data
pm_testing <- testing(pm_split)
# Do predictions on the testing data
pm_tidy_glm_fit %>%
predict(pm_testing) %>%
bind_cols(pm_testing) %>%
accuracy(truth = diabetes, estimate = .pred_class)
## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 accuracy binary 0.792
Just as with Base R, we can setup our data to produce a set of K folds to help us better estimate model performance on unseen data.
We could write our own code to do this but we do not need to as tidymodels has the resamples package to help us. Here we can get 6 folds from the training data:
pm_folds <- vfold_cv(pm_training, v = 6)
So what does this structure look like?
glimpse(pm_folds)
## Rows: 6
## Columns: 2
## $ splits <list> [<vfold_split[447 x 90 x 537 x 9]>], [<vfold_split[447 x 90 x …
## $ id <chr> "Fold1", "Fold2", "Fold3", "Fold4", "Fold5", "Fold6"
pm_folds$splits[[1]]
## <Analysis/Assess/Total>
## <447/90/537>
pm_folds$splits[[1]][["data"]][1:5,]
## pregnant glucose pressure triceps insulin mass pedigree age diabetes
## 415 0 138 60 35 167 34.6 0.534 21 pos
## 463 8 74 70 40 49 35.3 0.705 39 neg
## 179 5 143 78 0 0 45.0 0.190 47 neg
## 526 3 87 60 18 0 21.8 0.444 21 neg
## 195 8 85 55 20 0 24.4 0.136 42 neg
Let’s repeat our exercise with fitting a model using the folds. While we are at it, we will collect the mean accuracy and roc_auc for this experiment.
logistic_reg(mode = "classification") %>%
set_engine("glm") %>%
fit_resamples(diabetes ~ .,pm_folds) %>%
collect_metrics() %>%
select(.metric,mean,n)
## # A tibble: 2 × 3
## .metric mean n
## <chr> <dbl> <int>
## 1 accuracy 0.767 6
## 2 roc_auc 0.823 6
Workflows allow us to consolidate a number of steps into a definition. From the manual:
A workflow is an object that can bundle together your pre-processing, modeling, and post-processing requests. For example, if you have a recipe and parsnip model, these can be combined into a workflow. The advantages are:
You don’t have to keep track of separate objects in your workspace.
The recipe prepping and model fitting can be executed using a single call to fit().
Lets define a model as before
pm_tidy_glm <- logistic_reg(mode = "classification") %>%
set_engine("glm")
Next we will create a workflow to contain it. We can also add in other things than just models but let’s keep it simple for now.
pm_workflow <- workflow() %>%
add_formula(diabetes~.) %>%
add_model(pm_tidy_glm)
Check it out:
pm_workflow
## ══ Workflow ════════════════════════════════════════════════════════════════════
## Preprocessor: Formula
## Model: logistic_reg()
##
## ── Preprocessor ────────────────────────────────────────────────────────────────
## diabetes ~ .
##
## ── Model ───────────────────────────────────────────────────────────────────────
## Logistic Regression Model Specification (classification)
##
## Computational engine: glm
Before we do anything with it, I wanted to point out that you could update specific components of an existing wrokflow. This is helpful if our workflow defintition has lots of individual components to it. In this case we do not have that many but this is a good time to point out the advantage of being able to update a workflow.
update_formula(pm_workflow,diabetes~mass)
## ══ Workflow ════════════════════════════════════════════════════════════════════
## Preprocessor: Formula
## Model: logistic_reg()
##
## ── Preprocessor ────────────────────────────────────────────────────────────────
## diabetes ~ mass
##
## ── Model ───────────────────────────────────────────────────────────────────────
## Logistic Regression Model Specification (classification)
##
## Computational engine: glm
So we can execute the workflow by “fitting” it.
pm_workflow_fit <- fit(pm_workflow,data=pm_training)
And then predictions can be made:
predict(pm_workflow_fit,pm_testing) %>%
bind_cols(new_data=pm_testing) %>%
conf_mat(truth=diabetes, estimate=.pred_class) %>%
summary()
## # A tibble: 13 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 accuracy binary 0.792
## 2 kap binary 0.510
## 3 sens binary 0.92
## 4 spec binary 0.556
## 5 ppv binary 0.793
## 6 npv binary 0.789
## 7 mcc binary 0.526
## 8 j_index binary 0.476
## 9 bal_accuracy binary 0.738
## 10 detection_prevalence binary 0.753
## 11 precision binary 0.793
## 12 recall binary 0.92
## 13 f_meas binary 0.852
This could be done in one go if desired. Those facile with tidy commands might prefer this although you are encouraged to keep things simple if you wish. You can break up the pipeline into segments wherein you save information along the way.
workflow() %>%
add_formula(diabetes~.) %>%
add_model(pm_tidy_glm) %>%
fit(data=pm_training) %>%
predict(new_data=pm_testing) %>%
bind_cols(pm_testing) %>%
conf_mat(truth=diabetes, estimate=.pred_class) %>%
summary()
## # A tibble: 13 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 accuracy binary 0.792
## 2 kap binary 0.510
## 3 sens binary 0.92
## 4 spec binary 0.556
## 5 ppv binary 0.793
## 6 npv binary 0.789
## 7 mcc binary 0.526
## 8 j_index binary 0.476
## 9 bal_accuracy binary 0.738
## 10 detection_prevalence binary 0.753
## 11 precision binary 0.793
## 12 recall binary 0.92
## 13 f_meas binary 0.852
What else can we do with workflows ? What about cross validation ? Yes! Let’s define everything from scratch which should serve as a review
# Define the model
pm_tidy_glm <- logistic_reg(mode = "classification") %>%
set_engine("glm")
# Define the workflow
pm_workflow <- workflow() %>%
add_formula(diabetes~.) %>%
add_model(pm_tidy_glm)
So if we wanted to train the model on a set of folds how would we do that? Where would we specify it in the workflow? First, we can create the folds using the following:
# Create 6 folds
pm_folds <- vfold_cv(training(pm_split), v = 6)
glimpse(pm_folds)
## Rows: 6
## Columns: 2
## $ splits <list> [<vfold_split[447 x 90 x 537 x 9]>], [<vfold_split[447 x 90 x …
## $ id <chr> "Fold1", "Fold2", "Fold3", "Fold4", "Fold5", "Fold6"
Back to the question of how to get these folds to be used in the training / fitting process?
# We can use the fit_resamples function to do a fit that uses the folds
log_fit <- fit_resamples(pm_workflow,
pm_folds,
metrics = my_metrics,
control = control_resamples(event_level = "second")) %>%
collect_metrics()
What do we get for our trouble ?
log_fit
## # A tibble: 4 × 6
## .metric .estimator mean n std_err .config
## <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 accuracy binary 0.763 6 0.0199 Preprocessor1_Model1
## 2 precision binary 0.707 6 0.0261 Preprocessor1_Model1
## 3 sens binary 0.559 6 0.0319 Preprocessor1_Model1
## 4 spec binary 0.877 6 0.0114 Preprocessor1_Model1
We see the average metric values across all folds which in this case are 6 in number. Keep in mind that we are simply trying to get an idea about how a trained model might perform on unseen data so if the reported values aren’t impressive that does not mean the model lacks value.
What else can a workflow contain ? What about including some steps that transform data such as removing highly correlated variables? Or, maybe scaling input data since some ML methods expect (or require) require that prior to use. We might also want to do one-hot encoding on categories data to better represent the different levels of a factor.
The way this works is that we:
There are two different types of operations that can be sequentially added to a recipe. Steps can include common operations like scaling a variable, creating dummy variables or interactions and so on. More computationally complex actions such as dimension reduction or imputation can also be specified.
Checks are operations that conduct specific tests of the data. When the test is satisfied, the data are returned without issue or modification. Otherwise, any error is thrown.
Here we define a recipe that indicates what it is we are predicting. We could more granularly define the types of features we have which might mean designating factors.
recipe(diabetes~.,data=training(pm_split)) %>% summary()
## # A tibble: 9 × 4
## variable type role source
## <chr> <chr> <chr> <chr>
## 1 pregnant numeric predictor original
## 2 glucose numeric predictor original
## 3 pressure numeric predictor original
## 4 triceps numeric predictor original
## 5 insulin numeric predictor original
## 6 mass numeric predictor original
## 7 pedigree numeric predictor original
## 8 age numeric predictor original
## 9 diabetes nominal outcome original
So now, let’s create a recipe that centers and scales all variables except the outcome variable. There are different ways to do this but here is once way. Note that we “prep” the recipe by calling the prep function. Check the reference for recipe at https://recipes.tidymodels.org/reference/index.html There are many processing options which are possible.
pm_recipe_prepped <- recipe(diabetes ~ .,data=training(pm_split)) %>%
step_center(all_numeric(), -all_outcomes()) %>%
step_scale(all_numeric(), -all_outcomes()) %>%
prep()
pm_recipe_prepped
## Data Recipe
##
## Inputs:
##
## role #variables
## outcome 1
## predictor 8
##
## Training data contained 537 data points and no missing data.
##
## Operations:
##
## Centering for pregnant, glucose, pressure, triceps, ... [trained]
## Scaling for pregnant, glucose, pressure, triceps, ... [trained]
But the recipe itself is just that - a document of intention for something to be prepped (which we just do) and later “baked” into usable data.
pm_training_baked <- pm_recipe_prepped %>%
bake(training(pm_split))
glimpse(pm_training_baked)
## Rows: 537
## Columns: 9
## $ pregnant <dbl> -1.12949397, 1.23517762, 0.34842577, -0.24274213, 1.23517762,…
## $ glucose <dbl> 0.51182436, -1.49323429, 0.66846956, -1.08595675, -1.14861483…
## $ pressure <dbl> -0.44669796, 0.07161838, 0.48627146, -0.44669796, -0.70585613…
## $ triceps <dbl> 0.87650581, 1.18735221, -1.29941900, -0.18037195, -0.05603339…
## $ insulin <dbl> 0.7418514, -0.2693811, -0.6892998, -0.6892998, -0.6892998, -0…
## $ mass <dbl> 0.32844867, 0.41809712, 1.66036859, -1.31083739, -0.97785741,…
## $ pedigree <dbl> 0.1571759, 0.6528145, -0.8398981, -0.1036865, -0.9964156, 0.5…
## $ age <dbl> -1.02629240, 0.49473525, 1.17074754, -1.02629240, 0.74823986,…
## $ diabetes <fct> pos, neg, neg, neg, neg, neg, pos, neg, pos, pos, neg, pos, n…
Similarly, if we want the testing data to have the same pre-processing we need to bake it using the prepped recipe. Note that the scaling is determined by the training to avoid data leakage when the
pm_testing_baked <- pm_recipe_prepped %>%
bake(testing(pm_split))
pm_testing_baked
## # A tibble: 231 × 9
## pregnant glucose pressure triceps insulin mass pedigree age diabetes
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <fct>
## 1 0.644 0.825 0.175 0.877 -0.689 0.200 0.427 1.42 pos
## 2 1.24 1.92 -0.239 -1.30 -0.689 -1.12 0.557 -0.0968 pos
## 3 -0.834 -1.02 -0.136 0.130 0.116 -0.504 -0.907 -1.03 neg
## 4 -0.538 2.36 0.0716 1.50 3.96 -0.197 -0.933 1.68 pos
## 5 0.348 1.39 0.175 -0.118 0.810 -0.799 0.311 1.51 pos
## 6 -1.13 -0.115 0.797 1.62 1.28 1.76 0.206 -0.181 pos
## 7 0.940 -0.459 0.279 -1.30 -0.689 -0.312 -0.654 -0.181 pos
## 8 1.24 -0.710 0.797 -1.30 -0.689 0.431 -0.266 1.42 neg
## 9 0.940 0.794 0.383 -1.30 -0.689 0.943 -0.646 0.833 pos
## 10 -0.834 -0.773 -0.136 -0.367 0.510 -1.13 0.0209 -0.942 neg
## # … with 221 more rows
We can consolidate the creation of a model along with the recipe used to pre-process data into a workflow. To review, here is our recipe minus the prep and a formula definition.
First, here is our recipe. We are just scratching the surface here. We could also do things like removing highly correlated variables and/or reduce the dimensionality of the data via PCA using step functions. For purposes of explanation, we will keep it simple and hust do the centering and scaling.
pm_recipe <- training(pm_split) %>%
recipe(diabetes ~ .) %>%
step_center(all_numeric(), -all_outcomes()) %>%
step_scale(all_numeric(), -all_outcomes()) %>%
prep()
Here is the model definition
pm_tidy_glm <- logistic_reg(mode = "classification") %>%
set_engine("glm")
Next we will create workflow to contain it. We can also add in other things than just models but let’s keep it simple for now.
pm_workflow <- workflow() %>%
add_recipe(pm_recipe) %>%
add_model(pm_tidy_glm)
Then we fit the workflow which brings it all together
pm_workflow_fit <- fit(pm_workflow,data=pm_training)
Make predictions:
predict(pm_workflow_fit,pm_testing) %>%
bind_cols(pm_testing) %>%
my_metrics(truth=diabetes, estimate=.pred_class,event_level = "second")
## # A tibble: 4 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 accuracy binary 0.792
## 2 sens binary 0.556
## 3 spec binary 0.92
## 4 precision binary 0.789
We could actually substitute a new model into the existing workflow. Let’s define a new model which in this case uses a randomForest method.
pm_tidy_rf <- rand_forest(mtry = 5, trees = 50) %>%
set_mode("classification") %>%
set_engine("randomForest")
Now we can update the workflow with a new model and complete the fit and predict in one go. This example is a little small but imagine if our workflow had lengthy recipe and model definitions. We don’t have to redefine it, we just update an existing one.
pm_workflow %>%
update_model(pm_tidy_rf) %>%
fit(data=pm_training) %>%
predict(new_data=pm_testing) %>%
bind_cols(pm_testing) %>%
my_metrics(truth=diabetes, estimate=.pred_class,event_level = "second")
## # A tibble: 4 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 accuracy binary 0.771
## 2 sens binary 0.556
## 3 spec binary 0.887
## 4 precision binary 0.726
Many times the models we wish to create have a number of parameters that can be set to influence the model. In general, the fewer of these we have to consider the easier it is to build the model although if we have a good understanding of a given algorithm well then we can achieve better results. Consider, for example, that random forest methods involve building a number of trees to offset the over training which is possible from any single tree.
This means that the number of desired trees can be set before the method is implemented. In fact, the model can be repeated using different values of the number of desired trees, to see if an optimal result can be achieved.
We call this (hyper)parameter tuning which is soemthing that tidymodels will do for us. In effect we are optimizing a desginated performance metric (e.g. accuracy, precision, recall) using parameters. Random Forests have other hyperparameters which could also be set such as the minimum number of cases that must reside in a node. Hyperparameters can be tuned together or separately.
In the example below we will create a model specification that indicates our desired method (random forest) along with the hyperparameters we wish to tune as part of our fit process.
tune_spec <-
rand_forest(mode = "classification",
mtry = tune(),
trees = tune()) %>%
set_engine("randomForest")
man_tune <- pm_workflow %>%
update_model(tune_spec) %>%
tune_grid(resamples = pm_folds,
grid = expand.grid(
mtry = c(2,4,6),
trees = c(20,40,60,80)
))
man_tune %>% collect_metrics()
## # A tibble: 24 × 8
## mtry trees .metric .estimator mean n std_err .config
## <dbl> <dbl> <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 2 20 accuracy binary 0.741 6 0.0177 Preprocessor1_Model01
## 2 2 20 roc_auc binary 0.783 6 0.0233 Preprocessor1_Model01
## 3 4 20 accuracy binary 0.739 6 0.0209 Preprocessor1_Model02
## 4 4 20 roc_auc binary 0.786 6 0.0253 Preprocessor1_Model02
## 5 6 20 accuracy binary 0.735 6 0.0211 Preprocessor1_Model03
## 6 6 20 roc_auc binary 0.777 6 0.0282 Preprocessor1_Model03
## 7 2 40 accuracy binary 0.750 6 0.0155 Preprocessor1_Model04
## 8 2 40 roc_auc binary 0.792 6 0.0212 Preprocessor1_Model04
## 9 4 40 accuracy binary 0.752 6 0.0183 Preprocessor1_Model05
## 10 4 40 roc_auc binary 0.794 6 0.0261 Preprocessor1_Model05
## # … with 14 more rows
man_tune %>%
collect_metrics() %>%
mutate(trees = factor(trees)) %>%
ggplot(aes(mtry, mean, color = trees)) +
geom_line(size = 1.5, alpha = 0.6) +
geom_point(size = 2) +
facet_wrap(~ .metric, scales = "free", nrow = 2) +
scale_x_log10(labels = scales::label_number()) +
scale_color_viridis_d(option = "plasma", begin = .9, end = 0) +
theme_bw()
man_tune %>% show_best()
## Warning: No value of `metric` was given; metric 'roc_auc' will be used.
## # A tibble: 5 × 8
## mtry trees .metric .estimator mean n std_err .config
## <dbl> <dbl> <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 2 80 roc_auc binary 0.808 6 0.0247 Preprocessor1_Model10
## 2 2 60 roc_auc binary 0.801 6 0.0231 Preprocessor1_Model07
## 3 4 80 roc_auc binary 0.797 6 0.0241 Preprocessor1_Model11
## 4 4 60 roc_auc binary 0.797 6 0.0239 Preprocessor1_Model08
## 5 6 40 roc_auc binary 0.795 6 0.0237 Preprocessor1_Model06
man_final <- finalize_workflow(pm_workflow,
select_best(man_tune,metric="roc_auc")) %>%
fit(pm_training_baked)
man_final
## ══ Workflow [trained] ══════════════════════════════════════════════════════════
## Preprocessor: Recipe
## Model: logistic_reg()
##
## ── Preprocessor ────────────────────────────────────────────────────────────────
## 2 Recipe Steps
##
## • step_center()
## • step_scale()
##
## ── Model ───────────────────────────────────────────────────────────────────────
##
## Call: stats::glm(formula = ..y ~ ., family = stats::binomial, data = data)
##
## Coefficients:
## (Intercept) pregnant glucose pressure triceps insulin
## -0.85736 0.35006 1.14048 -0.24517 0.05731 -0.19957
## mass pedigree age
## 0.69287 0.24021 0.20136
##
## Degrees of Freedom: 536 Total (i.e. Null); 528 Residual
## Null Deviance: 694.2
## Residual Deviance: 509.8 AIC: 527.8
We could set up a more extensive grid that automatically generates values according to a regular pattern. This will consume more CPU time but it could investiage a larger parameter space.
tune_spec <-
rand_forest(mode = "classification") %>%
set_engine("randomForest", tree_depth = tune(), cost_complexity = tune())
# Setup a tuning grid which contains candidate values for the above hyperparameters
tree_grid <- grid_regular(cost_complexity(),tree_depth(),
levels = 10)
tree_grid
## # A tibble: 100 × 2
## cost_complexity tree_depth
## <dbl> <int>
## 1 0.0000000001 1
## 2 0.000000001 1
## 3 0.00000001 1
## 4 0.0000001 1
## 5 0.000001 1
## 6 0.00001 1
## 7 0.0001 1
## 8 0.001 1
## 9 0.01 1
## 10 0.1 1
## # … with 90 more rows
Now, we fit the model over a range of parameters after we update the workflow to use the new tuning specification:
res <- pm_workflow %>%
update_model(tune_spec) %>%
tune_grid(resamples = pm_folds,
grid = tree_grid,
metrics = yardstick::metric_set(yardstick::precision))
res %>% collect_metrics()
## # A tibble: 100 × 8
## tree_depth cost_complexity .metric .estimator mean n std_err .config
## <int> <dbl> <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 1 0.0000000001 precision binary 0.780 6 0.0251 Preproce…
## 2 1 0.000000001 precision binary 0.777 6 0.0253 Preproce…
## 3 1 0.00000001 precision binary 0.787 6 0.0269 Preproce…
## 4 1 0.0000001 precision binary 0.776 6 0.0271 Preproce…
## 5 1 0.000001 precision binary 0.779 6 0.0226 Preproce…
## 6 1 0.00001 precision binary 0.786 6 0.0271 Preproce…
## 7 1 0.0001 precision binary 0.783 6 0.0279 Preproce…
## 8 1 0.001 precision binary 0.777 6 0.0281 Preproce…
## 9 1 0.01 precision binary 0.788 6 0.0291 Preproce…
## 10 1 0.1 precision binary 0.775 6 0.0263 Preproce…
## # … with 90 more rows
best_params <-
res %>%
tune::select_best(metric = "precision")
best_params
## # A tibble: 1 × 3
## tree_depth cost_complexity .config
## <int> <dbl> <chr>
## 1 13 0.0000001 Preprocessor1_Model084
res %>%
collect_metrics() %>%
mutate(tree_depth = factor(tree_depth)) %>%
ggplot(aes(cost_complexity, mean, color = tree_depth)) +
geom_line(size = 1.5, alpha = 0.6) +
geom_point(size = 2) +
facet_wrap(~ .metric, scales = "free", nrow = 2) +
scale_x_log10(labels = scales::label_number()) +
scale_color_viridis_d(option = "plasma", begin = .9, end = 0)